Similarity analysis with advanced relationships on big data
نویسنده
چکیده
Similarity analytic techniques such as distance based joins and regularized learningmodels are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover’s Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-JOIN. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, one widely-adopted approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintaining a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiii iments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
منابع مشابه
A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملMining Essential Relationships under Multiplex Networks
In big data times, massive datasets often carry different relationships among the same group of nodes, analyzing on these heterogeneous relationships may give us a window to peek the essential relationships among nodes. In this paper, first of all we propose a new metric “similarity rate” in order to capture the changing rate of similarities between node-pairs though all networks; secondly, we ...
متن کاملTourMiner: Effective and Efficient Clustering Big Mobile Social Data for Supporting Advanced Analytics Tools
Nowadays a great deal of attention is devoted to the issue of supporting big data analytics over big mobile social data. These data are generated by modern emerging social systems like Twitter, Facebook, Instagram, and so forth. Mining big mobile social data has been of great interest, as analyzing such data is critical for a wide spectrum of big data applications (e.g., smart cities). Among se...
متن کاملAnalysis of Patient Groups and Immunization Results Based on Subspace Clustering
Biomedical experts are increasingly confronted with what is often called Big Data, an important subclass of high-dimensional data. High-dimensional data analysis can be helpful in finding relationships between records and dimensions. However, due to data complexity, experts are decreasingly capable of dealing with increasingly complex data. Mapping higher dimensional data to a smaller number of...
متن کاملA Geometric View of Similarity Measures in Data Mining
The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...
متن کامل